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Abstract 

In this paper, the high-dimensional sparse hnear regression model is considered, 
where the overall number of variables is larger than the number of observations. We 
investigate the Li penalized least absolute deviation method. Different from most of 
other methods, the Li penalized LAD method does not need any knowledge of standard 
deviation of the noises or any moment assumptions of the noises. Our analysis shows 
that the method achieves near oracle performance, i.e. with large probability, the L2 



norm of the estimation error is of order 0{y^klogp/n). The result is true for a wide 
range of noise distributions, even for the Cauchy distribution. Numerical results are 
also presented. 
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1 Introduction 



High dimensional linear regression model, where the number of observations is much less 
than the number of unknown coefficients, has attracted much recent interests in a number 
of fields such as applied math, electronic engineering, and statistics. In this paper, we 
consider the following classical high dimensional linear model: 

Y = Xl3 + z. (1) 

where Y = {yi,y2, ■ ■ ■ , Un)' is the n dimensional vector of outcomes, X is the n x p design 
matrix, and z = {zi,Z2,-'' j-^n)' is the n dimensional vector of measurement errors (or 
noises). We assume X = {Xi, X2, • • • ,Xp) where Xi G ii" denotes the ith regressor or 
variable. Throughout, we assume that each vector X^ is normalized such that = 7i 

for i = 1,2, ■ ■ ■ ,p. We will focus on the high dimensional case where p > n and our goal is 
to reconstruct the unknown vector /3 G RP. 

Since we are considering a high dimensional linear regression problem, a key assumption 
is the sparsity of the true coefficient /3. Here we assume, 

T = supp{(3) has k < n elements. 

The set T of nonzero coefficients or significant variables is unknown. In what follows, the 
true parameter value /3 and p and k are implicitly indexed by the sample size n, but we 
omit the index in our notation whenever this does not cause confusion. 

Ordinary least square method is not consistent in the setting p > n. In recent years, 
many new methods have been proposed to solve the high dimensional linear regression 
problem. Methods based on Li penalization or constrained Li minimization have been 
extensively studied. Dantzig selector was proposed in [9], which can be written as 

$DS = arg min ||7||i, subject to — X7)||oo < cay^2nlogp, 
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for some constant c > 1. It is clear that the Dantzig selector depend on the standard 
deviation of the noises and the Gaussian assumption. General constrained Li minimization 
methods for noiseless case and Gaussian noise were studied in [6]. More results about the 
constrained Li minimization can be found in for example [8], [11], [7] and the references 
therein. 

Besides the constrained minimization methods, the lasso (Li penalized least square) 
type methods have been studied in a number of papers, for example, [l9], [3], and [l6]. The 
classical lasso estimator can be written as 

Piasso = argmin^||y - Xj\\l + A||7||i, 

where A is the penalty level (tuning parameter). In the setting of Gaussian noise and known 
variance, it is suggested in [3] that the penalty could be 

A = 2ccjVn$-i(l - a/2p), 

where c > 1 is a constant and a is small chosen probability. By using this penalty value, it 
was shown that the lasso estimator can achieve near oracle performance, i.e. ||Aasso~/3||2 < 
C7(A:log(2p/a)/?i)i/2 for some constant C > with probability at least 1 — a. 

The lasso method has nice properties, but it also replies heavily on the Gaussian as- 
sumption and a known variance. In practice, the Gaussian assumption may not hold and 
the estimation of the standard deviation a is not a trivial problem. In a recent paper, [2] 
proposed the square-root lasso method, where the knowledge of the distribution or variance 
are not required. Instead, some moment assumptions of the errors and design matrix are 
needed. Other than the constrained optimization or penalized optimization methods, the 
stepwise algorithm are also studied, see for example [21] and [5]. It is worth noting that to 
properly apply the stepwise methods, we also need assumptions on the noise structure or 
standard deviation of the noises. 
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It is now seen that for most of the proposed methods, the noise structure plays an 
important role in the estimation of the unknown coefficients. In most of the existing lit- 
eratures, either an assumption on the error distribution or a known variance is required. 
Unfortunately, in the high dimensional setup, these assumptions are not always true. More- 
over, in cases where heavy-tailed errors or outliers are found in the response, the variance 
of the errors may be unbounded. Hence the above methods cannot be applied. 

To deal with the cases where the error distribution is unknown or may has heavy tail. 
We propose the following Li penalized least absolute deviation (Li PLAD) estimator. 

/3 G argmin{7 : \\Y - XjWi + X\\j\\i}. (2) 

The least absolute deviation (LAD) type of methods are important when heavy-tailed 
errors present. These methods have desired robust properties in linear regression models, 
see for example [1], and [T7j- Recently, the penalized version of the LAD method 
was studied. Variable selection properties and consistency of the Li penalized LAD were 
discussed in for example [20], [12], and [15] . 

In this paper, we present analysis for the Li PLAD method and we discuss the selection 
of penalty level, which does not depend on any unknown parameters or the noise distribu- 
tion. Our analysis shows that the Li PLAD method has surprisingly good properties. The 
main contribution of the present paper has twofold. (1) We proposed a rule for setting the 
penalty level, it is simply 

A = c^y2A{a)nlogp, 

where c > 1 is a constant, a is a chosen small probability, and A(a) is a constant such 
that 2p~(^(°)~^) < a. In practice, we suggest to take c = 1.1 or we can simply choose 
A = ^/2n]ogp, see the numerical study section for more discussions. This choice of penalty 
is universal and we only assume that the noises have median 0. (2) We show that with high 
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probability, the estimator has near oracle performance, i.e. with high probability 

It is important to notice that we do not have any assumptions on the distribution or 
moments of the noise. Actually, even for Cauchy distributed noise, where the first order 
moment does not exist, our results still hold. 

Importantly, the problem retains global convexity, making the method computationally 
efficient. Actually, we can use ordinary LAD method package to solve the Li penalized 
LAD estimator. This is because if we consider the penalty terms as new observations, i.e. 
Yn^i = and x^+ij = A x I{j = i) for i, j = 1,2, ■ ■ ■ ,p. Then our Li penalized estimator 
can be considered as an ordinary LAD estimator with p unknown coefficients and p + n 
observations. Hence it can be solved efficiently. 

The rest of the paper is organized as follows. Section 2 discusses the choice of penalty 
level. In section 3, the main results about the estimation error and several critical lemmas 
are presented. We also briefly explain the main idea of the proofs. Section 4 presents 
the simulation study results, which shows the Li penalized LAD method has very good 
numerical performance regardless the noise distribution. Technical lemmas and the proofs 
of theorems are given in section 5. 

2 Choice of Penalty 

In this section, we discuss the choice of the penalty level for the Li PLAD estimator. For 
any 7 G RP, let Qi'f) = \\Y — X^\\i. Then the Li PLAD estimator can be written as 

/3 G argmin{7 : Q{j) + A||7||i}. 

An important quantity to determine the penalty level is the sub-differential of Q evaluated 
at the point of true coefficient /3. Recall that the measurement errors Zi follow some 
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continuous distribution with median 0. Assume that Zi ^ for all i, then the sub-differential 
of (5(7) = \\Y — X7II1 at point 7 = /3 can be written as 



where sign{x) denotes the sign of x, i.e. sign{x) = 1 if x > 0, sign{x) = — 1 if x < 0, and 
sign{0) = 0. Let / = sign{z), then I = (Ii, I2, ■ " ^ In)' where li = sign{zi). Since Zi's are 
independent and have median 0, we know that P{Ii = 1) = P{Ii = —1) = 0.5 and /j are 
independent. 

The sub-differential of Q{'y) at the point of /3, 5 = X'l, summaries the estimation error 
in the setting of linear regression model. We will choose a penalty A that dominates the 
estimation error with large probability. This principle of selecting the penalty A is motivated 
by [3] and j2]. It is worth noting that this is a general principle of choosing the penalty 
and can be applied to many other problems. To be more specific, we will choose a penalty 
A such that it is greater than the maximum absolute value of S with high probability, i.e. 
we need to find a penalty level A such that 



for a given constant c > 1 and a given small probability a. Note that c is a theoretical 
constant and in practice we can simply take c = 1.1. Since the distribution of / is known, 
the distribution of USHoo is known for any given X and does not depend on any unknown 
parameters. 

Now for any random variable W let QaiW) denote the 1 — a quantile of W. Then 
in theory, ^adlSHoo) is known for any given X. Therefore if we choose A = co'ctdlSHoo), 
inequality ([3]) is satisfied. 

In practice, it might be hard to calculate the exact quantile (Zadl-SHoo) for a given X. One 
possible way to calculate or approximate it is by simulation, but this will cause additional 



S = X'{sign{zi), sign{z2), ■ ■ ■ 



sign{zn))' 



i^lA > c||5||oo) > 1-a, 



(3) 
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computation time. Here we propose the following asymptotic choice of penalty. 

A = cy^2A{a)nlogp, (4) 

where A{a) > is a constant such that 2p~^^^°'^~^^ < a. 

To show that the above choice of penalty satisfies ([3]), we need to bound the tail proba- 
bility of ^ijli for i = 1,2, ■ ■ ■ ,p. This can be done by using the Hoeffding's inequality, 
see for example |13j . and union bounds. We have the following lemma. 

Lemma 1 The choice of penalty A = Cy^2A{a)nlogp as in ^ satisfies 

-P(A > c||5||oo) > 1 - a. 

From the proof previous lemma, we can see that if we use the following special choice 
of A, 

A = 2cy/nlogp, (5) 

Then we have that 

^'(A>c||5||oo)>l--. (6) 
P 

The above penalties are simple and have good theoretical properties. Moreover, they 
do not require any conditions on matrix X or value of p and n. But in practice, since 
the bounds here are not very tight, these penalty levels tend to be relatively large and 
can cause additional bias to the estimator. It is worth pointing out that if there exists an 
i € {1,2, • • • ,p} such that H-'^illi < A, then /3j must be 0. Otherwise we can replace /3j by 
0, and the value of Q{f3) + A||/3||i will reduce by at least (A — This means if the 

penalty level A is too large, the Li PLAD method may kill some of the significant variables. 
To deal with this issue, we propose the following refined asymptotic choice of penalty level, 
provided some moment conditions on design matrix X. 
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Lemma 2 Suppose 

-B = sup sup — ||Xj||^ < oo, (7) 

n l<j<p 

for some constant q > 2. Assume ^^^{1 — a/2p) < {q — 2)^/^ogn. Then the choice of 
penalty X = c^/n^~^{l — ^) satisfies 

^'(A > c||5||oo) >l-a{l+ujn), 
where ujn goes to as n goes to infinity. 

This choice of penalty rephes on moment conditions of X and relative size of p and n, 
but it could be smaller than the previous ones and in practice it will cause less bias. We 
investigate the effect of different penalties in the numerical study section. 

To simplify our arguments, in the following theoretical discussion we will use ^ as the 
default choice of penalty. It can be seen that the above choices of penalty levels do not 
depend on the distribution of measurement errors Zi or unknown coefficient (3. As long as 
ZiS are independent continuous random variables with median 0, the choices satisfy our 
requirement. This is a big advantage over the traditional lasso method, which significantly 
relies on the Gaussian assumption and the variance of the errors. 

3 Properties of the Estimator 

In this section, we present the properties of the Li PL AD estimator. We shall state the 
upper bound for estimation error h = /3 — /3 under L2 norm ||/i||2- We shall also present 
the variable selection properties for both noisy and noiseless cases. The choice of penalty 
is described in the previous section. Throughout the discussion in this section, we assume 
the penalty A satisfies A > c||S'||oo for some fixed constant c > 1. In what follows, for any 
set £' C {1,2, • • • ,p} and vector h £ Rf\ let He = hI{E) denote the p dimensional vector 
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such that we only keep the coordinates of h when their indexes are in E and replace others 
by 0. 



3.1 Conditions on design matrix X 

We will first introduce some conditions on design matrix X. Recall that we assume A > 
cll'S'llooi this implies the following event, namely h = (5 — (3 belongs to the restricted set A(5, 
where 

Ac= {5(^R^ ■.\\5t\\i>C\\5tA\i, 

where rc{l,2,-- - ,p} and T contains at most k elements.}, 

and C = [c — l)/(c + 1). To show this important property of the Li PLAD estimator, 
recall that /3 minimizes \\X^ — Y\\i + A||7||i. Hence 

||X/i + z||i + A||/3||i < ||z||i + A||/3||i. 

Let T denote the set of significant coefficients. Then 

+ <A(||/ir||i-||/iTH|i)- (8) 

Since the sub-differential of (5(7) at the point of (3 is X'l, where / = sign(z). 

\\Xh + z\\i - \\z\\i > {Xh)'l > h'X'l > -\\h\\i\\X'l\\^ > --{\\hT\\i - \\hT4i)- 

c 

So 

WhrWi > C\\hT4i, (9) 

where C = 

The fact that h G Aq is extremely important for our arguments. This fact is also 
important for the arguments of classical lasso method and the square-root lasso method, 
see for example, [3] and [2]. 
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Now we shall define some important quantities of design matrix X. Let be the 
smallest number such that for any k sparse vector d £ R^, 

\\Xd\\l<Xl\\d\\l 

Here k sparse vector d means that the vector d has at most k nonzero coordinates, or 
\\d\\o < k. Similarly, let Ai be the largest number such that for any k sparse vector d G RP, 



\\Xdg > Ail 



12- 



The definition of the above constants are essentially the Restricted Isometry Constants, 
see for example |10] . but we use different notations for upper and lower bounds. We also 
need to define the following restricted eigenvalues of design matrix X. These definitions 
are based on the idea of [3]- Let 



= max and r?^((7) = max "^^"^ 



heAcnWhrh fceA^ V"II^t||2 " 

To show the properties of the Li penalized LAD estimator, we need both k\{C) and 
r]l.{C) to be bounded away from 0. To simplify the notations, when it is not causing any 
confusion, we will simply write ^^((7) as k^., and rjl.{C) as 77^. 

3.2 Important Lemmas 

Before presenting the main theorem, we first state a few critical lemmas. From ([8]), we 
know that 

\\Xh + z\\i - \\z\\i < A||/it||i. 
To bound the estimation error, we shall first investigate the random variable -^(HX/i + 
~ Iklli)- I^oi" vector d € RP, let 

B{d) = ^ \{\\Xd + zh - \\z\\i)-E(\\Xd + z\\i - ||z||i)| . 
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We introduce the following important result. 

Lemma 3 Suppose Zi 's are independent random variables. Assume p > n and p > 3k^ 
then 

P \ sup B{d) > (1 + 2Ci\[>^)^j2k\ogp\ < 2p-^*^(^?-^\ (10) 

\|lrf||0 = fe,|M||2 = l ^ J 

where Ci > 1 is a constant. 

Prom the above lemma, we know that with probability at least 1 — 2p~^^^'-"^~^\ for any 
k sparse vector d £ R^, 

-^{\\Xd + z\\i - \\z\\i) > -^E{\\Xd + z\\i - ||z||i)-Cv/2A;logp||/i||2, (11) 



n 

where C = 1 + 2Ci-y/A|'. This lemma shows that with high probability, the value of the 
random variable -^(||X(i+2;||i — ||2;||i) is very close to its expectation. Since the expectation 
is fixed and much easier to analysis than the random variable itself, this lemma plays an 
important role in our proof of the main theorem. 

Next, we will investigate the properties of £'(||Xd + z||i — H^lji). We have the following 
lemmas. 

Lemma 4 For any continuous random variable zi, we have that 

d£:(|.. + x|-|z.|)^^_^ 
dx 

Now we will introduce the scale assumptions on the measurement errors Zi. suppose 
there exists a constant a > such that 

P{zi >x)< — i — for ah a; > 
2 + ax 

1 

P{zi <x)< — for all rr < 0. (12) 

2 + a\x\ 

Here a served as a scale parameter of the distribution of Zj. This is a very weak condition 
and even Cauchy distribution satisfies it. Based on this assumption, we have that for any 
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c> 0, 



E{\zi + c\ - \zi\) = c-2 [ P{zi < -x)dx 

Jo 

> c-2 — dx = c log(l + ^c). 

Jo 2 + ax a 2 



Hence we have the following lemma. 

Lemma 5 Suppose random variable z satisfies condition lil^) . then 

E{\zi + c\-\z,\)>^\c\{\c\A-). (13) 
lb a 

Remark 1 This is just a weak bound and can he improved easily. But for simplicity, we 
use this one in our discussion. 

3.3 Main Theorem 

Now we shall propose our main result. Here we assume that the measurement errors Zi are 
independent and identically distributed random variables with median 0. We also assume 
that ZjS satisfy condition (I12p . Moreover, we assume r/^ > 0, > and 

^4>AyfcA^ + Ci72fcT^(1.25 + i), (14) 
lb G 

for some constant Ci such that Ci > 1 + 2yO^. We have the following theorem. 

Theorem 1 Under the previous assumptions, the Li penalized LAD estimator (3 satisfies 
with probability at least 1 — 2p~'^^^^i~^^^^ 



^ /2fclogpl6(c^/2 + 1.25C7i + Ci/C) f T 

12 < \ 1 A/l + ^■ 

n arjj, V C 



where Ci = 1 + 2C2y^^ and C2 > I is a constant. 



Remark 2 From the proof of the theorem, we can see that the identically distributed as- 
sumption of the measurement errors is not essential. We just need that there exist a constant 
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a > such that for all i, P{zi > x) < 2+ax f'^'^ x > Q and P{zi < x) < 2+a\x\ f'^^ x < 0. 
This is also verified in the section of simulation study. 

From the theorem we can easily see that asymptotically, with high probability, 



llP-Plk = OiJ'^). (15) 

V n 

This means that asymptotically, the Li PLAD estimator has near oracle performance and 
hence it matches the asymptotic performance of the lasso method with known variance. 

A simple consequence of the main theorem is that the Li PLAD estimator wih select 
most of the significant variables with high probability. We have the following theorem. 

Theorem 2 Suppose T = supp{l3) be the estimated support of the coefficients. Then under 
the same conditions as in TheoremUl with probability at least 1 — 2p~^'^('^2-i)+i^ 



h,lp,l>.^E^3S^^l±l^^^l±^\ct, (10) 

[ V n arjl. J 

where Ci = 1 + 2C2y^ and C2 > I is a constant. 

Remark 3 This theorem shows that the Li PLAD method will select a model that contains 
all the variables with large coefficients. If in the main model, all the nonzero coefficients 
are large enough in terms of absolute value, then the Li PLAD method can select all of 
them into the model. 

A special but important case in high dimensional linear regression is the noiseless case. 
The next theorem shows that the Li PLAD estimator has nice variable selection property 
in the noiseless case. 

Theorem 3 Consider the noiseless case. Suppose we use a penalty level A such that A < 
nK^(l), the Li penalized LAD estimator (] satisfies /3 = (3. 
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Remark 4 Suppose are bounded away from for all n and we use the penalty level 

X = 2^/rl\ogp. Then when y/\ogp = o(n) and n large enough. The Li penalized LAD 
estimator /3 satisfies (3 = (3. 

4 Numerical Study 

In this section, we will show some numerical results. Throughout this section, we use 
n = 200, p = 400 and /c = 5 and set /3 = (3, 3, 3, 3, 3, 0, • • • ,0). We wih study both the 
estimation properties and variable selection properties of the Li PLAD estimator under 
various noise structures. In our simulation study, we generate the design matrix X by i.i.d. 
A^(0, 1) random variables and then normalize the columns. 

We first investigate the effect of different choices of penalty levels. Then we compare 
the Li PLAD method and the lasso method in the Gaussian noise case. We also study 
the numerical properties of Li PLAD estimator under different noise structures, including 
the heteroscedastic cases. We use the quantreg package and lars package in R to run the 
simulation. 

4.1 Effect of Penalty levels 

Section 2 discusses the choice of penalty levels. It is known that our desired choice is 
cga(||S'||oo)- But since this value is hard to calculate, we propose several upper bounds 
and asymptotic choices. Now we will investigate the effect of different choices of penalty 
levels on the Li PLAD estimator. To be specific, we consider the following four penalties, 
Ai = ^/1.5n logp, X2 = "v/Snlogp, A3 = y/3n logp, and A4 = \/An log p. Note that they are 
all fixed choices and do not depend on any assumptions or parameters. For noises, we use (a) 
A^(0, 1) noise, (b) t(2) noise, and (c) Cauchy noise. For each setting, we run the simulation 
200 times and the average L2 norm square of the estimation errors are summarized in the 

14 



Table 1: The average of estimation error ||/3 — /3||2 over 200 simulations under different 
penalty levels and error distributions. Numbers in the parentheses are the medians of the 
estimation errors of post Li PLAD method, i.e. results of ordinary LAD estimators on the 
selected subset. 





Ai 


A2 


A3 


A4 


A^(0, 1) noise 


0.658 (0.356) 


1.054 (0.239) 


3.189 (0.095) 


23.730 (4.586) 


t{2) noise 


1.263 (0.552) 


2.351 (0.299) 


10.121 (0.081) 


33.018 (18.771) 


Cauchy noise 


2.176 (0.861) 


4.736 (0.334) 


21.417 (0.103) 


39.351 (26.241) 



following table. 

From table [1] we can see that A4 is too large in our setup and it kills most of the 
variables. (It is worth noting that if we increase the sample size to for example n = 400 
and p = 800, A4 becomes a reasonable choice.) Moreover, larger A cause more bias to the 
estimator. In practice, an ordinary least square method or least absolute deviation method 
could be applied to the selected variables to correct the bias (post Li PLAD method). We 
summarized the median of the ordinary LAD estimators on the selected subset in the above 
table. It can be seen that among the four penalty levels, Ai has the best results in terms 
of the estimation error ||/3 — /3||2, and A3 has the best results in terms of post Li PLAD 
estimation error. The post Li PLAD results are very good for all three noise distributions 
even though the i(2) distribution does not have bounded variance and Cauchy distribution 
does not have bounded expectation. 

4.2 Gaussian Noise 

Now consider the Gaussian noise case, i.e. Zi are independent and identically normal random 
variables. The standard deviation a of Zi is varied between and 3. Here we also include 
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the noiseless, where the traditional lasso cannot select the model correctly. We will use 
penalty level A = \J2n logp and run 200 times for each value of o. For each simulation, 
we use both the L\ PLAD method and the classical lasso method. For the lasso method, 
we use 0" X A as the penalty level, where we assume the standard deviation is known. In 
the noiseless case, we use 0.01 x A as the penalty level for the lasso method. Here we 
summaries the average estimation error and the variable selection results of both methods 
for five different a. 

In table O the average type 1 error means the average number of significant variables 
that are unselected over 200 runs. The average type II error means the average number 
of insignificant variables that are selected over 200 runs. The results show that in terms 
of estimation, the classical lasso method does better than L\ PLAD method, except the 
noiseless case. This is partly because that lasso knows the standard deviation and L\ PLAD 
does not. Also, the penalty level for L\ PLAD method has stronger shrinkage effect and 
hence cause more bias. 



Table 2: The average of estimation error ||/3 — /3||2 over 200 replications and the variable 
selection results for lasso and L\ penalized LAD method. 



Value of a 


a = 


a = 0.25 


a = 0.5 


a = 1 


cr = 3 


Li PLAD: Average of ||/3 - /3||2 





0.065 


0.269 


1.057 


8.988 


Li PLAD: Average type I error 

















Li PLAD: Average type II error 





0.185 


0.150 


0.120 


0.175 


Lasso: Average of /3 — /3 2 


11.419 


0.062 


0.106 


0.344 


3.498 


Lasso: Average type I error 

















Lasso: Average type II error 


24.125 


0.825 


0.875 


0.710 


0.95 
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In term of variable selection, the Li PLAD method does better than classical lasso 
method. The two methods both select all the significant variables in all the 200 simulations. 
The Li PLAD method has smaller average type II errors which means the lasso method 
tends to select more incorrect variables than the Li PLAD method. It is worth noting that 
Li PLAD method does a perfect job in noiseless case, it selects the perfect model in every 
run. While the lasso method never have a correct variable selection result. 

4.3 Heavy tail and Heteroscedastic Noise 

In the proof of Theorem [1] and all the discussions, the identically distribution assumption 
is not essential for our arguments. Now we will study the performance of the Li PLAD 
estimator when the noises Zi are just independent and not identically distributed. We will 
consider three cases: (a) Zj ~ N{0, af), where ai ^ U (0, 3) and are independent, (b) Zi/si ~ 
t(2), where Si ~ C/(0,3) and are independent, (c) With probability 1/3 Zj ~ N{0,a'^) and 
(Tj ~ C/(0,3), with probability 1/3 Zi/si ~ t(2) and Si ~ ^7(0, 3), and with probability 1/3 
Zi/si follows exponential distribution with parameter 1 and Si ~ [7(0,3) (relocated such 
that the median is 0). We use penalty A = ^/2nTogp for all cases. It is worth noting that in 
all the cases, traditional lasso method and the constrained minimization methods cannot 
be properly applied since the variances of the noises are unbounded. 

Table [3] summaries the average estimation errors and variable selection properties of the 
Li PLAD method over 200 runs. We also summarize the estimation errors of the post Li 
PLAD method in the parentheses. It can be seen that the Li PLAD method has very nice 
estimation and variable selection properties for all cases. Compare the variable selection 
results here with the Gaussian noise case in table [21 we can see that although we have 
many different noise structures, the Li PLAD method can always select a good model. Its 
variable selection results here are comparable to the Gaussian noise case. 



17 



5 Proofs 

We will first show some technical lemmas and then prove the main results. 



5.1 Technical Lemmas 

We first state the Slastnikov-Rubin-Sethuraman Moderate Deviation Theorem. Let X„j, i = 
l,...,kn',n > 1 be a double sequence of row- wise independent random variables with 
E{Xni) = 0, E{X^i) < oo, i = 1, ... ,kn; n > 1, and = Y^'^^i E{Xl^) ^ oo as n ^ oo. 



Let Fn{x) = P [Y2=i < xBnj . We have 

Lemma 6 (Slastnikov, Theorem 1.1) If for sufficiently large n and some positive constant 



i=l 



|2+c- 



>(|X„,|)log-(i+'=')/2(3 + < g{Bn)B, 



2 



where p{t) is slowly varying function monotonically growing to infinity and g{t) = o{p{t)) 
as t ^ oo, then 

1 - Fn{x) ~ 1 - <^{x),Fn{-x) ~ U ^ OO, 

uniformly in the region < X < cyiogSf. 



Table 3: The average of estimation error — /3||2 over 200 replications and the variable 
selection results for the Li PLAD method. Numbers in the parentheses are the medians of 
the estimation errors of post Li PLAD method. 





Case (a) 


Case (b) 


Case (c) 


Average of /3 - /? 2 


2.141 (0.253) 


4.355 (0.269) 


2.108 (0.218) 


Average type I error 








0.005 


Average type II error 


0.145 


0.155 


0.16 
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Corollary 1 (Slastnikov, Rubin- Sethuraman) If q> c? + 2 and 



Y,E[\Xr,,\'']<KB, 



1=1 



then there is a sequence 7^ — ?• 1, such that 



1 - Fnjx) + Fn{-X) ^ 

2(1 - ^{x)) 



< 7„ — 1 — 0, n — 00, 



um 



iformly in the region 0<x< c^/logBl. 



Remark. Rubin-Sethuraman derived the corollary for x = ty^logB^ for fixed t. Slast- 
nikov's result adds uniformity and relaxes the moment assumption. We refer to [H] for 
proofs. 

Next, we will state a couple of simple yet useful results. Suppose [/ > is a fixed 
constant. For any x = {xi,X2, • • • , x„) G i?", let 

n 

G{x)=Y,\x^\{\Xi\AU), 
i=l 

where a Ab denotes the minimum of a and b. Then we have the following results. 
Lemma 7 For any x = {xi,X2, ■ ■ ■ ,Xn) G i?", we have that 

if \\x\\i > nU/2 
G{x) > < ^ II II - / 

ll^clli ^/ Iklli < '^f^ /2- 
Proof. Let y = x/U , then it is easy to see that 

1=1 

We first consider the case where ||y||i > n/2. Now suppose \yi\ < 1 for i = 1,2, ■ ■ ■ ,k (note 
that k might be or n), and \yi\ > 1 for i > k. Then 



Gjx) 



\i + Y.yf -Y^lvil > \\y\\i-j > 



i=l 1=1 
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Now let us consider the case where ||y||i < n/2. Suppose there exists an i such that |yj| > 1, 
then there must be a j such that \yj\ < 1/2. If we replace yi and yj by = \yi\ — e > 1 
and y'j = \yj\ + e < 1/2 for some e > 0, the value of G{x)/U'^ decreases. This means that if 
G{x)/U'^ is minimized, all the yi must satisfy that \yi\ < 1. In this case, 

G{x)/U^ = Ml 

Putting the above inequalities together, the lemma is proved. ■ 
The following lemma is from [7j. 

Lemma 8 For any x G i?", 



I II ^1 ^ V'^/ II ■ \ w 

X 2 ^ < ( niax \Xi\ — mm \Xi\] . 

^/n 4 ^l<i<n l<i<n ' 



Remark 5 A interesting consequence of the above lemma is: for any x E i?", 

II II / Iklli , \/^||a;||oo 
/n 4 



5.2 Proof of Lemma [T] 

In this section, we will prove lemma [1] by union bound and Hoeffding's inequality. Firstly, 

by the union bound, it can be seen that 

p 

P{c^2A{a)n\ogp < c\\S\\oo) < ^'(\/2A(a)nlogp < 

i=l 

For each i, by Hoeffiding inequality, 

P(V2A(a)nlogp < \XII\) < 2eM-'-^^^^} = 2p-^^-\ 
since H^ilH = n for all i. Therefore, 

P{cy'2A{a)nlogp < c\\S\\oo) < j?2p-^(°) < a. 
Hence the lemma is proved. 
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p 



5.3 Proof of Lemma [2] 

By the union bound, it can be seen that 

P(cV^$-i(l - a/{2p)) < c\\S\U) < ^ P{^^~\1 - a/{2p)) < \X[I\). 

i=l 

For each i, from Corollary [H 

P{^^-\l-a/{2p))<\X',I\) 
< 2(1 - $($-1(1 - a/(2p))))(l + iOn) = a/p{l + a;„), 

where ojn goes to as n goes to infinity, provided that $"^(1 — a/2p) < {q — 2)-^log n. 
Hence 

P{cV^<^-\l - a/{2p)) < c||5||oo) < a(l + 

5.4 Proof of Lemma [5] 

It is easy to see that when c > |, 



and when c < -, 



2 a 2 ac c 

c - - log(l + -c) > c - -— = -, 
a 2 a 4 2 



2 , , a , 2 ,oc 1 ,ac, 9, ac 

C - - log(l + c) > C - -(- - -(-)2) = — . 

a 2 a 2 o 2 lb 



Similarly, we can show that for any real number c, when |c| > 



E{\zi + c\ - \zi\) > 



a' 



and when |c| < ^, 

+ c| - IZil) > 



16 

Putting the above inequalities together, the lemma is proved. 
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5.5 Proof of Lemma [3] 

First, it can be seen that for any 1 < i < n, \ \{Xd)i — Zi\ — \zi\ \ < \ {Xd)i\. So — — |zj| 

is a bounded random variable for any fixed d. Hence for any fixed k sparse signal d ^ BP , 
by Hoeffding's inequality, we have 

P(BW>*)<2„p{-^p^}, 

for all t > 0. From the definition of A^, we know that 

P{B{d) >t)< 2exp{ 



In the above inequality, let t = C\J2k logp||d||2, we have 

P (B(d) > Cv/2H^||d||2) < 2p-^^'/^"^, (17) 



for all C > 0. Next we will find an upper bound for sup^g^p ij^ii^^;^ We shall 

use the e-Net and covering number argument. Consider the e-Net of the set {d G R^, \\d\\o = 
k, \\d\\2 = 1}. From the standard results of covering number, see for example [3], we know 
that the covering number of {d G R'^, \\d\\2 = 1} by e balls (i.e. {y G i?'^ : \\y — x\\2 < e}) is 
at most (3/e)^' for e < 1. So the covering number of {d G R^ , \\d\\o = k, \\d\\2 = 1} by e balls 
is at most (Sp/e)'' for e < 1. Suppose N is such a e-Net of {d G R^ , ||(i||o = k, \\d\\2 = 1}. 
By union bound, 

P(sup \B{d)\ > C^2k\ogp) < 2(3/e)VV^'^^S 
for all C > 0. Moreover, it can be seen that, 

sup \B{di) - B{d2)\ < -^\\X{di - d2)\\i < 2^Kle. 

Therefore 

sup \B{d)\ < sup \B{d)\ + 2Vn4e. 

d£RP ,\\d\\o=k,\\d\\2 = l <i&^ 
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Let e = a/ lyhi , we know that 



P sup \B{d)\ > Cy/2klogp] 

\rfe-R^,IM||o=fc,|M||2=i / 

< p(^sup|B(d)|>(C-l)y2H^) <2(-|^^)^ 

Under the assumption that p > n and p > 3k^, let C = 1 + 2Ciy^^ for some Ci > 1, we 
know that 



P sup \B{d)\ > (1 + 2CiJxi)^2klogp < 2p-^^^^^-^\ (18) 

\dei?^^,iidii(,=fc,iidii2=i ^ y 

Hence the lemma is proved. 



5.6 Proof of Lemma |4] 

Since \\zi + x\ — \zi\\ < \x\ is bounded, the expectation always exists. Suppose the density 
function of Zi is f{z) and x > 0. It is easy to see that 

foo fQ r — x 

E{\zi + x\-\zi\) = / f{t)xdt+ f{t){2t + x)dt- j f{t)xdt 

Jo J —X J — oo 

f{t)dt - / f{t)dt) + 2 1 2tf{t)dt 

-X J ~oo J —X 

.0 

= x{\ - 2P{zi < -x)) +2 2tf{t)dt. 

J —x 



1 - 2P{zi < -x) 



Hence it is easy to see that 

dE{\zi + x| - \zi\) 
dx 

5.7 Proof of Theorem [1] and [3] 

Now we will bound the estimation error of the Li penalized LAD estimator. Recall that 
/i = /? - /3 and /i G = {5 G i?P : ||5t||i > CH^t- ||i}- Without loss of generality, 
assume \hi\ > |/i2| > ••• ,> \hp\. Let 5*0 = {1,2,- •• ,/c}, we have hs^^ > Chs^. Partition 
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{1, 2, • • • ,p} into the following sets: 

5o = {1, 2, • • • , A:}, 5i = {A; + 1, • • • , 2k}, S2 = {2k + I, ■ ■ ■ , 3k}, ■■■ . 
Then it follows from lemma [8] that 



It is easy to see that 



J-{\\Xh + z\\i - \\z\\i) > -^iWXhs, + z\\i - \\z\\i) 

+ K) + ^lli - II^E + ^lli) (20) 



i>l ^ i=0 J=0 

Now for any fixed vector d, let 

M{d) = -^E{\\Xd + z\\i - \\z\\i] 
Jn 



By lemma [3l we know that with probability at least 1 — 2p ^'^('^2 ^) ^ 

-^iWXhs, + z\\i - \\z\\i) > M{hs,) - Civ'2/clogp||/isoll2, 
y n 

and for i > 1 with probability at least 1 — 2p~^'^('-^2-i)^ 

^ i i—l 

-=(||X(5^ hs^) + zh- \\X{Y^ hs^) + z||i) > M{hs^) - C^^j2k^\\hs, II2, 
i=o j=o 

where Ci = 1 + 2C2 y'^A^^ and C2 > 1 is a constant. Put the above inequalities together, we 
know that with probability at least 1 — 2p~^'^('^2-i)+ij 

-^{\\Xh + z\\i- \\z\\i) > M{h) - Ci72H^ V \\hs^\2. (21) 

By this and inequality ([8]) and ()19p . we have that with probability at least \ — 2p~'^^^'^2~^)+^ ^ 

M{h) < -^\\hs,h + Ci^/2k\^{l.2b + -^)\\hsA2. (22) 
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1 , a ,,,,,,,0 a-v/n ,,,, ,,9 , 

> ^^ill^^olli- (24) 



Next, we consider two cases. First, if > 3n/a, then from lemma [7] and inequality 

^EiWXh + - ||.||i) > Y^WXhh > ^^iWhso II2. (23) 

From assumption (jl4p . we must have ||/iSo||2 = and hence /3 = /3. 

On the other hand, if < 3n/a, from lemma [7] and inequality (I13p . 

-l=E(||.Y/, + .|K-|l--|IO>^l|A-/.||? 

Hence by (f22]) . we know that with probability at least 1 — 2p~^^^'^^-'^)+^ ^ 

,,, ,, ^ 16AVfc , /2fclogpl6C7i(1.25 + l/C) . , 

na?7^ V n arjf^ 

In particular, when A = 2c\Jn log p. Putting the above discussion together, we have 

||,,J|, < p^l6(e^/2 + 1.25C.+Cl/C)^ 

V ?i a?7^ 

Since 

XI 11^5.112 < l^fc+il XI ll^^'lli - ill^Solli, 

j>l i>l 

We know that with probability at least 1 — 2p~^'^(*^2~i)+ij 



/2A;logpl6(c\/2 + 1.25Ci + Ci/C) / T 
12 < A/ 1 a/1 + — ■ 



n ar?! V ' C 



where Ci = 1 + 2C2y''A|^ and C2 > 1 is a constant. 

The proof of Theorem [3] is simple. In the noiseless case, we know that 

< A(||/it||i-||^tHIi). 
This means ||/it||i > II^T<=l|i and hence h G Ai. So 

||X/i||i>n4(l)||/.T||i. 
Since we assume that n\^^(X) > A, we must have ||/i||i = 0. Therefore P = fi. 
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